Foundation of Mining Class-Imbalanced Data

نویسندگان

  • Da Kuang
  • Charles X. Ling
  • Jun Du
چکیده

Mining class-imbalanced data is a common yet challenging problem in data mining and machine learning. When the class is imbalanced, the error rate of the rare class is usually much higher than that of the majority class. How many samples do we need in order to bound the error of the rare class (and the majority class)? If the misclassification cost of the class is known, can the costweighted error be bounded as well? In this paper, we attempt to answer those questions with PAC-learning. We derive several upper bounds on the sample size that guarantee the error on a particular class (the rare and majority class) and the cost-weighted error, with the consistent and agnostic learners. Similar to the upper bounds in traditional PAC learning, our upper bounds are quite loose. In order to make them more practical, we empirically study the pattern observed in our upper bounds. From the empirical results we obtain some interesting implications for data mining in real-world applications. As far as we know, this is the first work providing theoretical bounds and the corresponding practical implications for mining class-imbalanced data with unequal cost.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering

 Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...

متن کامل

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

Data mining with imbalanced class distributions: concepts and methods

Some real world data mining applications present imbalanced or skewed class distributions. In these domains, the underrepresented classes are often the ones we are more interested in. However, most learning algorithms are not able to induce meaningful classifiers in some imbalanced domains. One reason for this poor performance is that learning algorithms tend to focus in abundant classes to max...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012